Author: Igor Matheus S. Moreira
Data set: Brazilian houses to rent
This work aims at answering the following three hypotheses:
This notebook was produced using python $3.7.6$ (with conda) and has the following requirements:
altair == $4.1.0$numpy == $1.19.2$pandas == $1.2.3$scipy == $1.5.2$import altair as alt
import numpy as np
import pandas as pd
from scipy import stats
%%capture
alt.data_transformers.disable_max_rows()
The data set used herein is brazilian_houses_to_rent. In particular, the houses_to_rent_v2.csv file is used.
houses = pd.read_csv("houses_to_rent_v2.csv")
houses.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10692 entries, 0 to 10691 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 city 10692 non-null object 1 area 10692 non-null int64 2 rooms 10692 non-null int64 3 bathroom 10692 non-null int64 4 parking spaces 10692 non-null int64 5 floor 10692 non-null object 6 animal 10692 non-null object 7 furniture 10692 non-null object 8 hoa (R$) 10692 non-null int64 9 rent amount (R$) 10692 non-null int64 10 property tax (R$) 10692 non-null int64 11 fire insurance (R$) 10692 non-null int64 12 total (R$) 10692 non-null int64 dtypes: int64(9), object(4) memory usage: 1.1+ MB
houses.head()
| city | area | rooms | bathroom | parking spaces | floor | animal | furniture | hoa (R$) | rent amount (R$) | property tax (R$) | fire insurance (R$) | total (R$) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | São Paulo | 70 | 2 | 1 | 1 | 7 | acept | furnished | 2065 | 3300 | 211 | 42 | 5618 |
| 1 | São Paulo | 320 | 4 | 4 | 0 | 20 | acept | not furnished | 1200 | 4960 | 1750 | 63 | 7973 |
| 2 | Porto Alegre | 80 | 1 | 1 | 1 | 6 | acept | not furnished | 1000 | 2800 | 0 | 41 | 3841 |
| 3 | Porto Alegre | 51 | 2 | 1 | 0 | 2 | acept | not furnished | 270 | 1112 | 22 | 17 | 1421 |
| 4 | São Paulo | 25 | 1 | 1 | 0 | 1 | not acept | not furnished | 0 | 800 | 25 | 11 | 836 |
houses.describe()
| area | rooms | bathroom | parking spaces | hoa (R$) | rent amount (R$) | property tax (R$) | fire insurance (R$) | total (R$) | |
|---|---|---|---|---|---|---|---|---|---|
| count | 10692.000000 | 10692.000000 | 10692.000000 | 10692.000000 | 1.069200e+04 | 10692.000000 | 10692.000000 | 10692.000000 | 1.069200e+04 |
| mean | 149.217920 | 2.506079 | 2.236813 | 1.609147 | 1.174022e+03 | 3896.247194 | 366.704358 | 53.300879 | 5.490487e+03 |
| std | 537.016942 | 1.171266 | 1.407198 | 1.589521 | 1.559231e+04 | 3408.545518 | 3107.832321 | 47.768031 | 1.648473e+04 |
| min | 11.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000e+00 | 450.000000 | 0.000000 | 3.000000 | 4.990000e+02 |
| 25% | 56.000000 | 2.000000 | 1.000000 | 0.000000 | 1.700000e+02 | 1530.000000 | 38.000000 | 21.000000 | 2.061750e+03 |
| 50% | 90.000000 | 2.000000 | 2.000000 | 1.000000 | 5.600000e+02 | 2661.000000 | 125.000000 | 36.000000 | 3.581500e+03 |
| 75% | 182.000000 | 3.000000 | 3.000000 | 2.000000 | 1.237500e+03 | 5000.000000 | 375.000000 | 68.000000 | 6.768000e+03 |
| max | 46335.000000 | 13.000000 | 10.000000 | 12.000000 | 1.117000e+06 | 45000.000000 | 313700.000000 | 677.000000 | 1.120000e+06 |
Before we begin, it would be nice to convert the floor column to a numeric type, as well as correct the typo in the animal column. When converting the dtype of floor, we will assume that all - entries (which amount to $23%$ of the entries) are houses and that the remaining entries (where floor is specified) are buildings. We will use this assumption to create the type of real estate column.
houses.loc[:, "floor"] = pd.to_numeric(houses.loc[:, "floor"].replace({"-": "0"}))
houses.loc[:, "animal"] = houses.loc[:, "animal"].map({"acept": "accept", "not acept": "doesn't accept"})
houses.loc[:, "type of real estate"] = houses.loc[:, "floor"].copy().apply(lambda x: "house" if x == 0 else "building")
Now, houses must receive the latitude and longitude of the cities.
latitude_values = {
"Belo Horizonte": -19.8917,
"Campinas": -22.9009,
"Porto Alegre": -30.0328,
"Rio de Janeiro": -22.9083,
"São Paulo": -23.5504
}
longitude_values = {
"Belo Horizonte": -43.9478,
"Campinas": -47.0573,
"Porto Alegre": -51.2300,
"Rio de Janeiro": -43.1964,
"São Paulo": -46.6339
}
houses.loc[:, "latitude"] = houses.loc[:, "city"].copy().map(latitude_values)
houses.loc[:, "longitude"] = houses.loc[:, "city"].copy().map(longitude_values)
Finally, to ensure the best visualization results, a version of houses with no outliers in hoa (R$), rent amount (R$), property tax (R$), fire insurance (R$), and total (R$) will be produced and termed houses_no_outliers.
hoa = np.abs(stats.zscore(houses.loc[:, "hoa (R$)"]))
rent = np.abs(stats.zscore(houses.loc[:, "rent amount (R$)"]))
tax = np.abs(stats.zscore(houses.loc[:, "property tax (R$)"]))
insurance = np.abs(stats.zscore(houses.loc[:, "total (R$)"]))
houses_no_outliers = houses.drop(houses.loc[(hoa >= 3) | (rent >= 3) | (tax >= 3) | (insurance >= 3)].index)
Once the data is properly loaded and preprocessed, it is time to explore the data and answer the questions asked at the beginning of this notebook. Before we begin, let us load a topoJSON file to print Brazil in a map.
brazil_data = "https://gist.githubusercontent.com/ppKrauss/0c33364240e841fa23e78b21005f792c/raw/153641252fe5e6499f12373b0789565e822e5ef2/br-states.json"
brazil = alt.topo_feature(brazil_data, feature="estados")
To answer this question, it would be interesting to see a geoplot with circle marks whose size change based on how many listings each city has. In addition, an accompanying barplot would be pertinent to see more clearly how the listings are distributed across the cities.
map_background = alt.Chart(brazil).mark_geoshape(
fill="lightgray",
stroke="white",
)
highlight_map = alt.selection(type="single", on="mouseover", fields=["city"])
map_overlay = alt.Chart(houses).transform_aggregate(
latitude="mean(latitude)",
longitude="mean(longitude)",
count="count()",
groupby=["city"]
).mark_circle().encode(
longitude="longitude:Q",
latitude="latitude:Q",
size=alt.Size("count:Q", title="Listings", legend=alt.Legend(orient="none", legendX=10, legendY=260),),
# color=alt.value("steelblue"),
color=alt.condition(highlight_map, "city", alt.ColorValue("gray")),
tooltip=["city:N","count:Q"]
).properties(
width=400,
height=380
).add_selection(
highlight_map
)
listings_distribution = alt.Chart(houses).mark_bar().encode(
x=alt.X("city", axis=alt.Axis(title="City"), sort=alt.EncodingSortField(field="city", op="count", order="descending")),
y=alt.Y("count(city)", axis=alt.Axis(title="Listings")),
color=alt.condition(highlight_map, "city", alt.ColorValue("gray"), legend=None),
).properties(
width=400
).add_selection(
highlight_map
)
((map_background + map_overlay) | listings_distribution).configure_axisX(
labelAngle=-45
).properties(
title="Listings per city"
).configure_title(
anchor="middle",
)
It is noticeable from the map that São Paulo has more listings than the other cities—in fact, São Paulo alone has 5887 listings, whereas the listings of all other states combined amount to 4805 listings. Given how unbalanced this data set is and how few cities it describes, it is safe to assume that this data set is not representative of the real estate situation in the whole country. This is more clearly depicted in the bar plot above. Both visualizations are linked and interactive, allowing the user to quickly see where each city is spatially located in Brazil.
As said in Preprocessing the data, the entries with - (which were transformed to 0) were assumed to represent houses, whereas the remaining ones represent buildings. This assumption is not explicitly backed up by the data and their description, but it is regarded as pertinent since (most) houses are not customarily rented/sold by floor.
absolute_distribution_type_real_estate = alt.Chart(houses).mark_bar().encode(
x=alt.X("rent amount (R$)", title="Rent amount (R$)", bin=alt.Bin(maxbins=np.round(np.sqrt(houses.shape[0])))),
y=alt.Y("count()", title="Listings"),
color=alt.Color("type of real estate", title="Type")
).properties(
title="With outliers"
)
absolute_distribution_type_real_estate_no_outliers = alt.Chart(houses_no_outliers).mark_bar().encode(
x=alt.X("rent amount (R$)", title="Rent amount (R$)", bin=alt.Bin(maxbins=np.round(np.sqrt(houses_no_outliers.shape[0])))),
y=alt.Y("count()", title="Listings"),
color=alt.Color("type of real estate", title="Type")
).properties(
title="Without outliers"
)
relative_distribution_type_real_estate = alt.Chart(houses).mark_bar().encode(
x=alt.X("rent amount (R$)", title="Rent amount (R$)", bin=alt.Bin(maxbins=np.round(np.sqrt(houses.shape[0])))),
y=alt.Y("count()", title="Listings", stack="normalize"),
color=alt.Color("type of real estate", title="Type")
)
relative_distribution_type_real_estate_no_outliers = alt.Chart(houses_no_outliers).mark_bar().encode(
x=alt.X("rent amount (R$)", title="Rent amount (R$)", bin=alt.Bin(maxbins=np.round(np.sqrt(houses_no_outliers.shape[0])))),
y=alt.Y("count()", title="Listings", stack="normalize"),
color=alt.Color("type of real estate", title="Type")
)
(absolute_distribution_type_real_estate | absolute_distribution_type_real_estate_no_outliers) & (relative_distribution_type_real_estate | relative_distribution_type_real_estate_no_outliers)
Some aspects can be observed from the visualizations above. Firstly, employing houses_no_outliers instead of houses improves the visualization by removing extreme and mild outliers. Hence, the space is better used and the user can better perceive how the listings are distributed in terms of rent costs. Overall, we see that the majority of the listings are buildings, particularly in the bins with lower rent amounts. It can also be noticed that there is not a balanced sample of listings for each price range.
This difficults making assumptions, as the most of the expensive price ranges contain less than $200$ instances per bin. By normalizing the histogram bins, we can see a tendency of more house listings as the price range increases (proportionally speaking). It would be nice to have additional details on the surroundings of these listings to see if their location plays a significant role in the rent amount; however, since the data set only provides the city of the listing, this analysis cannot be made. To succinctly answer the question, there seems to be a bigger presence of houses in more expensive listings; however, overall listings of apartments are far more frequent.
One take at answering this question could be to use similar visualizations to those seen in the previous question. This time, only plots made with houses_no_outliers will be displayed, as it allows the user to better see how the data is distributed.
absolute_distribution_animal_no_outliers = alt.Chart(houses_no_outliers).mark_bar().encode(
x=alt.X("rent amount (R$)", title="Rent amount (R$)", bin=alt.Bin(maxbins=np.round(np.sqrt(houses_no_outliers.shape[0])))),
y=alt.Y("count()", title="Listings"),
color=alt.Color("animal", title="Accepts pets?")
).properties(
title="Absolute distribution"
)
relative_distribution_animal_no_outliers = alt.Chart(houses_no_outliers).mark_bar().encode(
x=alt.X("rent amount (R$)", title="Rent amount (R$)", bin=alt.Bin(maxbins=np.round(np.sqrt(houses_no_outliers.shape[0])))),
y=alt.Y("count()", title="Listings", stack="normalize"),
color=alt.Color("animal", title="Accepts pets?")
).properties(
title="Normalized distribution"
)
absolute_distribution_animal_no_outliers | relative_distribution_animal_no_outliers
These plots enable us to see if there is a change in the rate of acceptance/rejection of animals as the rent amount increases, thus allowing us to ascertain if the rent cost has an influence in pet acceptance. This time around, the proportions of listings that accept pets or not in the normalized stacked bar plot are more constant. In the visualizations that were used to answer the previous question, there was a more visible tendency of growth for house listings as the rent prices went up. Here, however, this tendency does not seem as present. Yet again, the fact that quite some of the most expensive price ranges is represented by fewer than $200$ instances means that the actual distributions could vary significantly. The conclusion here is that, regardless of income, the majority of listings accepts animals.
This analysis could be complemented by also considering other factors: does the city significantly influences pet acceptance? What about if the real estate is a house or a building?
type_selector = alt.binding_select(options=[None, "building", "house"], labels=["All", "Building", "House"], name="Type of real estate: ")
selection = alt.selection_single(fields=["type of real estate"], bind=type_selector)
stripplot_pets = alt.Chart(houses_no_outliers[houses_no_outliers.loc[:, "animal"] == "accept"], width=100).mark_circle(size=8).encode(
x=alt.X(
"jitter:Q",
title=None,
axis=alt.Axis(values=[0], ticks=True, grid=False, labels=False),
scale=alt.Scale(),
),
y=alt.Y(
"rent amount (R$)",
title="Rent amount (R$)",
scale=alt.Scale(domain=(0, 16000))
),
# color=alt.Color("city:N", legend=None),
color=alt.condition(
selection,
alt.Color(
'city:N',
legend=None
),
alt.value('')
# alt.value('lightgray')
),
column=alt.Column(
"city:N",
header=alt.Header(
labelAngle=0,
titleOrient="top",
labelOrient="bottom",
labelAlign="center"
),
title=None
),
).transform_calculate(
# Generate Gaussian jitter with a Box-Muller transform
jitter="2*sqrt(-log(random()))*cos(2*PI*random())"
).properties(
title="Accept pets"
).add_selection(
selection
)
stripplot_no_pets = alt.Chart(houses_no_outliers.loc[houses_no_outliers.loc[:, "animal"] == "doesn't accept"], width=100).mark_circle(size=8).encode(
x=alt.X(
"jitter:Q",
title=None,
axis=alt.Axis(values=[0], ticks=True, grid=False, labels=False),
scale=alt.Scale(),
),
y=alt.Y(
"rent amount (R$)",
# title="Rent amount (R$)",
title=None,
scale=alt.Scale(domain=(0, 16000))
),
# color=alt.Color("city:N", legend=None),
color=alt.condition(
selection,
alt.Color(
'city:N',
legend=None
),
alt.value('')
# alt.value('lightgray')
),
column=alt.Column(
"city:N",
header=alt.Header(
labelAngle=0,
titleOrient="top",
labelOrient="bottom",
labelAlign="center"
),
title=None
),
).transform_calculate(
# Generate Gaussian jitter with a Box-Muller transform
jitter="2*sqrt(-log(random()))*cos(2*PI*random())"
).properties(
title="Do not accept pets"
).add_selection(
selection
)
(stripplot_pets | stripplot_no_pets).configure_title(
anchor="middle"
).configure_facet(
spacing=0
).configure_view(
stroke=None
)
These visualizations allow us to see which how the listings are distributed by city in terms of accepting pets or not with observance to the rent amount. In addition, these plots can be filtered by real estate type. In other words, four attributes are considered in these plots: rent amount, city, type of real estate, and acceptance of pets.
They show a similar trend to what was observed in the histograms: regardless of income, real estate type, and city, the majority of listings have no objection to animals. To answer the proposed question, it is relatively safe to assume that there is not a feature in this data set that is strongly correlated with the acceptance or prohibition of pets. As such, more information would be needed to further investigate this matter.